Simultaneous estimation of transcript abundances and transcript specific fragment distributions of RNA-Seq data with the Mix model
نویسندگان
چکیده
Quantification of RNA transcripts with RNA-Seq is inaccurate due to positional fragmentation bias, which is not represented appropriately by current statistical models of RNA-Seq data. Another, less investigated, source of error is the inaccuracy of transcript start and end annotations. This article introduces the Mix (rd. ”mixquare”) model, which uses a mixture of probability distributions to model the transcript specific positional fragment bias. The parameters of the Mix model can be efficiently trained with the EM algorithm and are tied between similar transcripts. Transcript specific shift and scale parameters allow the Mix model to automatically correct inaccurate transcript start and end annotations. Experiments are conducted on synthetic data covering 7 genes of different complexity, 4 types of fragment bias and correct as well as incorrect transcript start and end annotations. Abundance estimates obtained by Cufflinks 2.2.0, PennSeq and the Mix model show superior performance of the Mix model in the vast majority of test conditions. The Mix software is available at http://www.lexogen.com/fileadmin/uploads/bioinfo/mix2model.tgz, subject to the enclosed license. Additional experimental data are available in the supplement.
منابع مشابه
Transcript assembly and abundance estimation with high-throughput RNA sequencing
Title of dissertation: TRANSCRIPT ASSEMBLY AND ABUNDANCE ESTIMATION WITH HIGH-THROUGHPUT RNA SEQUENCING Bruce C. Trapnell, Jr., Doctor of Philosophy, 2010 Dissertation directed by: Professor Steven Salzberg Department of Computer Science We present algorithms and statistical methods for the reconstruction and abundance estimation of transcript sequences from high throughput RNA sequencing (“RNA...
متن کاملAn Enumerative Combinatorics Model for Fragmentation Patterns in RNA Sequencing Provides Insights into Nonuniformity of the Expected Fragment Starting-Point and Coverage Profile
RNA sequencing (RNA-seq) has emerged as the method of choice for measuring the expression of RNAs in a given cell population. In most RNA-seq technologies, sequencing the full length of RNA molecules requires fragmentation into smaller pieces. Unfortunately, the issue of nonuniform sequencing coverage across a genomic feature has been a concern in RNA-seq and is attributed to biases for certain...
متن کاملAtRTD – a comprehensive reference transcript dataset resource for accurate quantification of transcript‐specific expression in Arabidopsis thaliana
RNA-sequencing (RNA-seq) allows global gene expression analysis at the individual transcript level. Accurate quantification of transcript variants generated by alternative splicing (AS) remains a challenge. We have developed a comprehensive, nonredundant Arabidopsis reference transcript dataset (AtRTD) containing over 74 000 transcripts for use with algorithms to quantify AS transcript isoforms...
متن کاملNetwork-Based Isoform Quantification with RNA-Seq Data for Cancer Transcriptome Analysis
High-throughput mRNA sequencing (RNA-Seq) is widely used for transcript quantification of gene isoforms. Since RNA-Seq data alone is often not sufficient to accurately identify the read origins from the isoforms for quantification, we propose to explore protein domain-domain interactions as prior knowledge for integrative analysis with RNA-Seq data. We introduce a Network-based method for RNA-S...
متن کاملP-82: Effect of SCNT Steps on Relative mRNA Abundances of Sheep Oocytes
Background: The oocyte is a unique cell committed to reprogram fertilizing sperm and to support early stages of embryonic development until the species-specific stage of zygote genome activation that occurs around the second to third cell cycle in sheep embryos. In this sense, considering the huge list of oocyte transcripts, we selected some candidates genes based on their roles of regulating di...
متن کامل